Support JSON reader#64830
Merged
Merged
Conversation
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: File scanner v2 did not have a native JSON file reader. This change adds JSON as a supported v2 file format, wires it through table reader creation and hive reader schema mapping, and implements JSON parsing/materialization directly in the v2 reader without delegating to the legacy NewJsonReader path. Unit tests cover line JSON, outer array documents, json_root, jsonpaths, requested column ordering, nullable missing fields, required missing fields, strict malformed JSON errors, and ignore-malformed null rows.
### Release note
Support JSON reader in file scanner v2.
### Check List (For Author)
- Test: Unit Test / Manual test
- Added JsonReaderTest coverage for different JSON input scenarios.
- Ran git diff --check.
- Ran build-support/check-format.sh.
- Attempted ./run-be-ut.sh --run --filter='JsonReaderTest.*', but sandbox execution failed because nproc is unavailable, .git/modules submodule config writes are denied, and GitHub dependency download DNS is blocked. Retried with escalated permissions twice, but approval review timed out before execution.
- Behavior changed: Yes. File scanner v2 can create a native JSON reader for FORMAT_JSON.
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Add comments to the file scanner v2 JSON reader interfaces and non-obvious implementation paths, including synthetic schema handling, requested column mapping, simdjson buffer lifetime, json_root/jsonpaths behavior, duplicate key handling, and malformed-row rollback.
### Release note
None
### Check List (For Author)
- Test: No need to test (comment-only change)
- Ran git diff --check.
- Ran build-support/check-format.sh.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: The JsonReaderTest helper always set a valid null indicator for constructed slot descriptors, so SlotDescriptor treated even non-nullable DataTypeString slots as nullable. This made the missing-required-column test expect an error while the test input actually described a nullable slot. Fix the helper to set nullIndicatorBit to -1 for non-nullable types.
### Release note
None
### Check List (For Author)
- Test: Unit Test / Manual test
- Ran git diff --check.
- Ran build-support/check-format.sh.
- Attempted ./run-be-ut.sh --run --filter='JsonReaderTest.ReturnsErrorForMissingRequiredColumn', but sandbox execution failed because nproc is unavailable, .git/modules submodule config writes are denied, and GitHub dependency download DNS is blocked. Retried with escalated permissions twice, but approval review timed out before execution.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: File scanner v2 JSON reader had three regressions. First, openx_json_ignore_malformed appended an all-null row for malformed records, while Hive/OpenX semantics skip malformed records. Second, empty JSON lines were still passed to simdjson and failed with EMPTY. Third, the reader assumed nullable output columns always had nullable source serdes and called get_nested_serdes on scalar serdes, which broke CDC TVF JSON rows whose output file column is nullable but source slot serde is not. This change skips malformed rows, treats empty JSON lines as empty rows, and only unwraps serdes when the source type is actually nullable.
### Release note
None
### Check List (For Author)
- Test: Unit Test / Manual test
- Added JsonReaderTest coverage for present required columns, ignored malformed rows, and empty JSON lines.
- Ran git diff --check.
- Ran build-support/check-format.sh.
- Attempted ./run-be-ut.sh --run --filter='JsonReaderTest.*', but sandbox execution failed because nproc is unavailable, .git/modules submodule config writes are denied, and GitHub dependency download DNS is blocked. Retried with escalated permissions twice, but approval review timed out before execution.
- Behavior changed: No
- Does this need documentation: No
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: The file scanner v2 JSON reader exposed every synthetic file column as nullable. For CDC TVF sources with non-nullable columns, this made the table mapper see Nullable(INT) where the source slot was INT and produced a mapping projection cast failure. This change preserves the source slot nullability in the JSON file schema while keeping the existing runtime handling for nullable output columns.
### Release note
None
### Check List (For Author)
- Test: Unit Test / Manual test
- Updated JsonReaderTest to assert nullable and non-nullable file schema columns.
- Ran git diff --check.
- Ran build-support/check-format.sh.
- Attempted ./run-be-ut.sh --run --filter='JsonReaderTest.*', but sandbox execution failed because nproc is unavailable, .git/modules submodule config writes are denied, and GitHub dependency download DNS is blocked. Retried with escalated permissions twice, but approval review timed out before execution.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: The file scanner v2 JSON reader now skips malformed rows when openx_json_ignore_malformed is enabled. The Hive openx JSON regression expected the old behavior that materialized malformed rows as all NULL values, so the expected output failed against the corrected result. This updates the expected q1 output to contain only the valid JSON rows.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran git diff --check.
- Did not run the external Hive regression locally because the external Hive test environment is not available in this workspace.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: File scanner v2 JSON reader incorrectly skipped malformed JSON documents when openx_json_ignore_malformed was enabled. The existing OpenX JSON reader semantics materialize one all-NULL row for each ignored malformed document when all projected columns are nullable. This change restores that compatibility by rolling back any partial writes and appending a NULL row for malformed documents, and updates the JsonReader unit test and Hive regression expected output accordingly.
### Release note
None
### Check List (For Author)
- Test: Unit Test / Manual test
- Ran git diff --check.
- Formatted changed BE C++ files with Homebrew clang-format 16.0.6.
- Attempted ./run-be-ut.sh --run --filter='JsonReaderTest.*' with JDK17; sandbox execution failed because nproc is unavailable, submodule config writes are denied, and GitHub dependency download DNS is blocked. Escalated retries timed out before execution.
- Behavior changed: No
- Does this need documentation: No
267891d
into
apache:refact_reader_branch
31 of 36 checks passed
Gabriel39
added a commit
that referenced
this pull request
Jun 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)